Using Yellow Brick to Explore and Model the Famous Iris Dataset

Exploration Notebook by: Nathan Danielsen Prema Damodaran

Review of the iris dataset


In [1]:
# read the iris data into a DataFrame
import pandas as pd
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
iris = pd.read_csv(url, header=None, names=col_names)

In [2]:
iris.head()


Out[2]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

Terminology

  • 150 observations (n=150): each observation is one iris flower
  • 4 features (p=4): sepal length, sepal width, petal length, and petal width
  • Response: iris species
  • Classification problem since response is categorical

Lightly Preprocess the Dataset


In [ ]:
# map each iris species to a number
iris['species_num'] = iris.species.map({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2})

Import the Good Stuff


In [128]:
import yellowbrick as yb 
import matplotlib.pyplot as plt
%matplotlib inline

plt.rcParams['figure.figsize'] = (12, 8)

from yellowbrick.features.rankd import Rank2D
from yellowbrick.features.radviz import RadViz
from yellowbrick.features.pcoords import ParallelCoordinates

Feature Exploration with RadViz


In [26]:
# Specify the features of interest and the classes of the target

features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
classes = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']

# Extract the numpy arrays from the data frame
X = iris[features].as_matrix()
y = iris.species_num.as_matrix()

In [32]:
visualizer = RadViz(classes=classes, features=features)
visualizer.fit(X, y)      # Fit the data to the visualizer
visualizer.transform(X)   # Transform the data
visualizer.show()         # Draw/show/show the data


Setosas tend to have the largest septal-width. This can could be a great predictor.

Then, let's remove setosa from the training set and see fi we can find any differentiation between veriscolor and virginica.

Remove Setosa from the training set


In [141]:
# Specify the features of interest and the classes of the target
iris_subset = iris[iris.species_num!=0]
features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
classes = ['Iris-setosa','Iris-versicolor', 'Iris-virginica'] # but have to leave in more than two classes 

# Extract the numpy arrays from the data frame
X = iris_subset[features].as_matrix()
y = iris_subset.species_num.as_matrix()
assert y.shape[0] == X.shape[0]

In [142]:
visualizer = RadViz(classes=classes, features=features)
visualizer.fit(X, y)      # Fit the data to the visualizer
visualizer.transform(X)   # Transform the data
visualizer.show()         # Draw/show/show the data


Try the Covariance Visualizer


In [144]:
# Specify the features of interest and the classes of the target

features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
classes = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']

# Extract the numpy arrays from the data frame
X = iris[features].as_matrix()
y = iris.species_num.as_matrix()

visualizer = Rank2D(features=features, algorithm='covariance')

visualizer.fit(X, y)      # Fit the data to the visualizer
visualizer.transform(X)   # Transform the data
visualizer.show()         # Draw/show/show the data


This covariance chart is not intereptatble as they don't have labels. Also there shouldn't be half numbers in labels.

More Feature Exploration: Look at Parallel Coodinates for all Species


In [63]:
features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
classes = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']

# Extract the numpy arrays from the data frame
X = iris[features].as_matrix()
y = iris.species_num.values
assert y.shape[0] == X.shape[0]

In [64]:
visualizer = ParallelCoordinates(classes=classes, features=features)

visualizer.fit(X, y)      # Fit the data to the visualizer
visualizer.transform(X)   # Transform the data
visualizer.show()         # Draw/show/show the data


This clearly demonstrates the separation between features - especially petal_length and petal_width. One concern is that this demonstraction data might be obsured by the scaling of the features and add noise to the intepretation.

Feature Exploration: ParallelCoordinates with Scaling


In [65]:
from sklearn import preprocessing

In [66]:
features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
classes = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']

# Extract the numpy arrays from the data frame
X = iris[features].as_matrix()
X_scaled = preprocessing.scale(X)
y = iris.species_num.values
assert y.shape[0] == X.shape[0]

In [67]:
visualizer = ParallelCoordinates(classes=classes, features=features)

visualizer.fit(X_scaled, y)      # Fit the data to the visualizer
visualizer.transform(X)   # Transform the data
visualizer.show()


The scaled dataset makes it easier to see the separation between classes for each of the features.

*TODO - Add scaling option to PararalCordinates and potentially other visualizers

Now that we have some features, Let's Evaluate Classifiers

From the feature selection phase, we determined that petal_length and petal_width seem to have the best separation.


In [145]:
# Classifier Evaluation Imports

from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

from yellowbrick.classifier import ClassificationReport, ClassBalance

In [146]:
features = ['petal_length', 'petal_width']
classes = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']

# Extract the numpy arrays from the data frame
X = iris[features].as_matrix()
y = iris.species_num.as_matrix()
assert y.shape[0] == X.shape[0]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
assert y_train.shape[0] == X_train.shape[0]
assert y_test.shape[0] == X_test.shape[0]

In [147]:
# Instantiate the classification model and visualizer
bayes = GaussianNB()
visualizer = ClassificationReport(bayes, classes=classes)

visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
visualizer.score(X_test, y_test)  # Evaluate the model on the test data
g = visualizer.show()             # Draw/show/show the data


---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-147-4c8e0ef2dbcf> in <module>()
      4 
      5 visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
----> 6 visualizer.score(X_test, y_test)  # Evaluate the model on the test data
      7 g = visualizer.show()             # Draw/show/show the data

/usr/local/var/pyenv/versions/3.5.2/envs/yellowbrick/lib/python3.5/site-packages/yellowbrick/classifier.py in score(self, X, y, **kwargs)
    133         self.scores = map(lambda s: dict(zip(self.classes_, s)), self.scores[0:3])
    134         self.scores = dict(zip(keys, self.scores))
--> 135         return self.draw(y, y_pred)
    136 
    137     def draw(self, y, y_pred):

/usr/local/var/pyenv/versions/3.5.2/envs/yellowbrick/lib/python3.5/site-packages/yellowbrick/classifier.py in draw(self, y, y_pred)
    158         for column in range(len(self.matrix)+1):
    159             for row in range(len(self.classes_)):
--> 160                 self.ax.text(column,row,self.matrix[row][column],va='center',ha='center')
    161 
    162         fig = plt.imshow(self.matrix, interpolation='nearest', cmap=self.cmap, vmin=0, vmax=1)

IndexError: list index out of range

In [96]:
visualizer


Out[96]:
ClassificationReport(ax=<matplotlib.axes._subplots.AxesSubplot object at 0x10dedf7b8>,
           classes=None, model=None)

Note: There seems to be some sort of bug in the draw/ fit methods.

Let's try a naive bayes as the other didn't wrok


In [148]:
# Classifier Evaluation Imports

from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

from yellowbrick.classifier import ClassificationReport, ClassBalance

In [149]:
features = ['petal_length', 'petal_width']
classes = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']

# Extract the numpy arrays from the data frame
X = iris[features].as_matrix()
y = iris.species_num.as_matrix()
assert y.shape[0] == X.shape[0]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
assert y_train.shape[0] == X_train.shape[0]
assert y_test.shape[0] == X_test.shape[0]

In [151]:
# Instantiate the classification model and visualizer
bayes = MultinomialNB()
visualizer = ClassificationReport(bayes)# classes=classes)

visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
visualizer.score(X_test, y_test)  # Evaluate the model on the test data
g = visualizer.show()             # Draw/show/show the data


/usr/local/var/pyenv/versions/3.5.2/envs/yellowbrick/lib/python3.5/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-151-bf3e22a6d078> in <module>()
      4 
      5 visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
----> 6 visualizer.score(X_test, y_test)  # Evaluate the model on the test data
      7 g = visualizer.show()             # Draw/show/show the data

/usr/local/var/pyenv/versions/3.5.2/envs/yellowbrick/lib/python3.5/site-packages/yellowbrick/classifier.py in score(self, X, y, **kwargs)
    133         self.scores = map(lambda s: dict(zip(self.classes_, s)), self.scores[0:3])
    134         self.scores = dict(zip(keys, self.scores))
--> 135         return self.draw(y, y_pred)
    136 
    137     def draw(self, y, y_pred):

/usr/local/var/pyenv/versions/3.5.2/envs/yellowbrick/lib/python3.5/site-packages/yellowbrick/classifier.py in draw(self, y, y_pred)
    158         for column in range(len(self.matrix)+1):
    159             for row in range(len(self.classes_)):
--> 160                 self.ax.text(column,row,self.matrix[row][column],va='center',ha='center')
    161 
    162         fig = plt.imshow(self.matrix, interpolation='nearest', cmap=self.cmap, vmin=0, vmax=1)

IndexError: list index out of range

In [ ]:

Model Selection: Random Forest Classification


In [108]:
features = ['petal_length', 'petal_width']
classes = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']

# Extract the numpy arrays from the data frame
X = iris[features].as_matrix()
y = iris.species_num.as_matrix()
assert y.shape[0] == X.shape[0]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
assert y_train.shape[0] == X_train.shape[0]
assert y_test.shape[0] == X_test.shape[0]

In [115]:
test = pd.DataFrame(y_test, columns=['species'])

In [131]:
test.species.value_counts() # The test train split provides unbalanced classes


Out[131]:
1    13
0    11
2     6
Name: species, dtype: int64

In [132]:
from sklearn.ensemble import RandomForestClassifier
from yellowbrick.classifier import ClassificationReport

# Instantiate the classification model and visualizer
forest = RandomForestClassifier()
visualizer = ClassBalance(forest, classes=classes)

visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
visualizer.score(X_test, y_test)  # Evaluate the model on the test data
g = visualizer.show()             # Draw/show/show the data


Note: The visualization works, but the example provides an unbalanced training/ testing set that provides a misleading for a Randomforest classifier. A more powerful example would train with balanced classes.


In [ ]: